In this exercise, we will be using functions from the
tidyverse, performance,
gtsummary, broom, and emmeans
packages.
library(tidyverse)
library(performance)
library(gtsummary)
library(emmeans)
library(broom)
The file
pinot.csvcontains the results of an experiment by winemaker Vincent Lakey, comparing a standard herbicide with two greener alternatives: straw mulch and compost.The experiment used these three treatments in each of six different areas of the vineyard, recorded in the variable called
blockin the data set,pinot.csv.For this exercise, we will fit a one-way ANOVA to the response variable
Weight2001(the weight of grapes harvested in 2001, in kg) using the explanatory variableTreatment.Make a residual plot using
check_model(). (To make the residual plot look nicer in the knitted output, you may want to change fig.width and fig.height - look at the solution for an example.)Use
joint_tests()to test the hypothesis that all treatments resulted in equal mean harvest weights.Use
emmeans()to produce a table of means and confidence intervals for the three treatments.Use
pairs()to produce a table of pairwise comparisons, with confidence intervals and p-values. (Note that this would not normally be reported, as the ANOVA result provides no reason to believe treatment means differ.)Extension: produce a plot of the means of the three treatments, with error bars showing 95% confidence intervals.
pinot <- read_csv("pinot.csv")
m <- lm(Weight2001 ~ Treatment, data = pinot)
check_model(m)
joint_tests(m)
model term df1 df2 F.ratio p.value
Treatment 2 15 1.663 0.2226
emmeans(m, "Treatment")
Treatment emmean SE df lower.CL upper.CL
compost 3.93 0.498 15 2.87 4.99
herbicide 3.27 0.498 15 2.21 4.33
straw 4.56 0.498 15 3.50 5.62
Confidence level used: 0.95
emmeans(m, "Treatment") %>%
pairs(adjust = "none", infer = TRUE)
contrast estimate SE df lower.CL upper.CL t.ratio p.value
compost - herbicide 0.657 0.704 15 -0.843 2.157 0.933 0.3655
compost - straw -0.627 0.704 15 -2.127 0.873 -0.891 0.3872
herbicide - straw -1.283 0.704 15 -2.783 0.217 -1.824 0.0882
Confidence level used: 0.95
emmeans(m, "Treatment") %>%
as_tibble() %>%
mutate(Treatment = fct_reorder(Treatment, emmean)) %>%
ggplot(aes(y = Treatment, x = emmean, xmin = lower.CL, xmax = upper.CL)) +
geom_errorbar(width = 0.5) +
geom_point() +
labs(x = "Mean harvest in 2001 (kg)") +
scale_x_continuous(limits = c(2, 6)) +
theme_minimal() +
theme(panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_blank())
Read in the file
olympic_100m_results.csvand usefilter()to select the results for women (Gender isW) gold medalists (Medal isG).Fit a linear regression to result as a function of year.
Use
model_performance()to obtain model summary statistics, e.g. R-squared.Make a residual plot using
check_model().Obtain an estimate, confidence interval and p-value for the slope using
tidy()ortbl_summary(). (Is the rounding appropriate? Look at the solutions for a way to control the number of decimal places.)Plot the original data and use
geom_smooth()to add a line of best fit.Does this model appear to be a good fit to the data? Is it plausible for this relationship to continue into the future?
olympic_100m <- read_csv("olympic_100m_results.csv")
Rows: 138 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Gender, Event, Location, Medal, Name, Nationality
dbl (3): Year, Result, Time
lgl (1): Wind
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
olympic_100m_women_gold <- filter(olympic_100m, Gender == "W", Medal == "G")
m <- lm(Result ~ Year, data = olympic_100m_women_gold)
model_performance(m)
# Indices of model performance
AIC | AICc | BIC | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------
-4.118 | -2.518 | -1.285 | 0.807 | 0.795 | 0.185 | 0.196
check_model(m)
tbl_regression(m, estimate_fun = ~style_number(., digits = 3))
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| Year | -0.014 | -0.018, -0.011 | <0.001 |
| 1 CI = Confidence Interval | |||
tidy(m, conf.int = TRUE)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 39.4 3.35 11.8 0.00000000136 32.3 46.5
2 Year -0.0143 0.00170 -8.42 0.000000181 -0.0179 -0.0107
ggplot(olympic_100m_women_gold,
aes(x = Year, y = Result)) +
geom_smooth(method = "lm") +
geom_point()
`geom_smooth()` using formula = 'y ~ x'
Fit a linear model to both men and women’s 100m sprint gold medal times, with gender, year, and the interaction in the model.
Are the parameter estimates (from
tbl_regression()ortidy()) easy to interpret?Can you use
emmeans()to obtain the mean for each gender?This continuous-by-categorical interaction is best analysed using emmeans features we haven’t seen yet:
emmeans(m, "Gender", at = list(Year = 2020))obtains estimated means for each gender in the year 2020.emtrends(m, "Gender", "Year")obtains estimated slopes for each gender.Use
pairs()to obtain an estimate and confidence interval for the difference in slopes between genders.
olympic_100m_gold <- filter(olympic_100m, Medal == "G")
m <- lm(Result ~ Gender*Year, data = olympic_100m_gold)
check_model(m)
Variable `Component` is not in your data frame :/
tidy(m, conf.int = TRUE)
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 35.0 2.26 15.5 6.09e-19 30.4 39.5
2 GenderW 4.42 4.40 1.00 3.21e- 1 -4.47 13.3
3 Year -0.0126 0.00116 -10.9 8.05e-14 -0.0149 -0.0103
4 GenderW:Year -0.00169 0.00224 -0.757 4.53e- 1 -0.00621 0.00282
tbl_regression(m, estimate_fun = ~style_number(., digits = 3))
| Characteristic | Beta | 95% CI1 | p-value |
|---|---|---|---|
| Gender | |||
| M | — | — | |
| W | 4.421 | -4.467, 13.309 | 0.3 |
| Year | -0.013 | -0.015, -0.010 | <0.001 |
| Gender * Year | |||
| W * Year | -0.002 | -0.006, 0.003 | 0.5 |
| 1 CI = Confidence Interval | |||
emmeans(m, "Gender", at = list(Year = 2020))
NOTE: Results may be misleading due to involvement in interactions
Gender emmean SE df lower.CL upper.CL
M 9.54 0.084 42 9.37 9.71
W 10.54 0.104 42 10.33 10.75
Confidence level used: 0.95
emtrends(m, "Gender", "Year")
Gender Year.trend SE df lower.CL upper.CL
M -0.0126 0.00116 42 -0.0149 -0.0103
W -0.0143 0.00192 42 -0.0182 -0.0104
Confidence level used: 0.95
emtrends(m, "Gender", "Year") %>%
pairs(adjust = "none") %>%
summary(infer = TRUE)
contrast estimate SE df lower.CL upper.CL t.ratio p.value
M - W 0.00169 0.00224 42 -0.00282 0.00621 0.757 0.4534
Confidence level used: 0.95
© 2021 Statistical Consulting Centre, The University of Melbourne.